Observational study;
Causal Inference;
Matching;
R
Welders are exposed to chromium and nickel, substances that can cause inappropriate links between DNA and proteins, which in turn may disrupt gene expression or interfere with replication of DNA. Costa, Zhitkovich, and Toniolo [10] measured DNA-protein cross-links in samples of white blood cells from 47 subjects (all male). 01
Unmatched data for 21 railroad arc welders and 26 potential controls. Covariates are age, race (C=Caucasian, AA=African American), current smoker (Y=yes, N=no). Response is DPC=DNA-protein cross-links in percent in white blood cells. All 47 subjects are male.
Welders — Controls
ID | Age | Race | Smoker | DPC | ID | Age | Race | Smoker | DPC | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 38 | C | N | 1.77 | 1 | 48 | AA | N | 1.08 | |
2 | 44 | C | N | 1.02 | 2 | 63 | C | N | 1.09 | |
3 | 39 | C | Y | 1.44 | 3 | 44 | C | Y | 1.1 | |
4 | 33 | AA | Y | 0.65 | 4 | 40 | C | N | 1.1 | |
5 | 35 | C | Y | 2.08 | 5 | 50 | C | N | 0.93 | |
6 | 39 | C | Y | 0.61 | 6 | 52 | C | N | 1.11 | |
7 | 27 | C | N | 2.86 | 7 | 56 | C | N | 0.98 | |
8 | 43 | C | Y | 4.19 | 8 | 47 | C | N | 2.2 | |
9 | 39 | C | Y | 4.88 | 9 | 38 | C | N | 0.88 | |
10 | 43 | AA | N | 1.08 | 10 | 34 | C | N | 1.55 | |
11 | 41 | C | Y | 2.03 | 11 | 42 | C | N | 0.55 | |
12 | 36 | C | N | 2.81 | 12 | 36 | C | Y | 1.04 | |
13 | 35 | C | N | 0.94 | 13 | 41 | C | N | 1.66 | |
14 | 37 | C | N | 1.43 | 14 | 41 | AA | Y | 1.49 | |
15 | 39 | C | Y | 1.25 | 15 | 31 | AA | Y | 1.36 | |
16 | 34 | C | N | 2.97 | 16 | 56 | AA | Y | 1.02 | |
17 | 35 | C | Y | 1.01 | 17 | 51 | AA | N | 0.99 | |
18 | 53 | C | N | 2.07 | 18 | 36 | C | Y | 0.65 | |
19 | 38 | C | Y | 1.15 | 19 | 44 | C | N | 0.42 | |
20 | 37 | C | N | 1.07 | 20 | 35 | C | N | 2.33 | |
21 | 38 | C | Y | 1.63 | 21 | 34 | C | Y | 0.97 | |
22 | 39 | C | Y | 0.62 | ||||||
23 | 45 | C | N | 1.02 | ||||||
24 | 42 | C | N | 1.78 | ||||||
25 | 30 | C | N | 0.95 | ||||||
26 | 35 | C | Y | 1.59 |
# loadind data
<- read_excel("data/table81.xlsx", sheet = "data")
gen.tox
<- gen.tox[c(1:3, 22:24),]
data.tab kable(data.tab, caption = "Generic Toxicology Dataset (6-observation example)") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
ID | Group | Age | Race | Smoker | DPC |
---|---|---|---|---|---|
1 | Welders | 38 | C | N | 1.77 |
2 | Welders | 44 | C | N | 1.02 |
3 | Welders | 39 | C | Y | 1.44 |
22 | Controls | 48 | AA | N | 1.08 |
23 | Controls | 63 | C | N | 1.09 |
24 | Controls | 44 | C | Y | 1.10 |
<- round(mean(gen.tox$Age[gen.tox$Group=="Welders"]),0)
m.age.w <- round(mean(gen.tox$Age[gen.tox$Group=="Controls"]),0)
m.age.c <- round(table(gen.tox$Race[gen.tox$Group=="Welders"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Welders"]))*100,0)
aa.p.w <- round(table(gen.tox$Race[gen.tox$Group=="Controls"])[1]/sum(table(gen.tox$Race[gen.tox$Group=="Controls"]))*100,0)
aa.p.c <- round(table(gen.tox$Smoker[gen.tox$Group=="Welders"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Welders"]))*100,0)
s.p.w <- round(table(gen.tox$Smoker[gen.tox$Group=="Controls"])[2]/sum(table(gen.tox$Smoker[gen.tox$Group=="Controls"]))*100,0) s.p.c
Welders — Controls
Mean Age | AA | Smoker | — | Mean Age | AA | Smoker |
---|---|---|---|---|---|---|
38 | 10 | 52 | — | 43 | 19 | 35 |
t.test(Age ~ Group, data=gen.tox)
Welch Two Sample t-test
data: Age by Group
t = 2.2537, df = 42.167, p-value = 0.02947
alternative hypothesis: true difference in means between group Controls and group Welders is not equal to 0
95 percent confidence interval:
0.4662145 8.4422104
sample estimates:
mean in group Controls mean in group Welders
42.69231 38.23810
<- table(gen.tox$Group, gen.tox$Race)
tbl.age.group tbl.age.group
AA C
Controls 5 21
Welders 2 19
chisq.test(tbl.age.group)
Pearson's Chi-squared test with Yates' continuity correction
data: tbl.age.group
X-squared = 0.26754, df = 1, p-value = 0.605
fisher.test(tbl.age.group, conf.int = T, conf.level = 0.95)
Fisher's Exact Test for Count Data
data: tbl.age.group
p-value = 0.4364
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.3170452 25.9844336
sample estimates:
odds ratio
2.22477
The welders are about 5 years younger than the controls on average, have relatively fewer African Americans, and more smokers.
Given the t-test (t = 2.25) and a two-sided significance level of 0.03 showed the difference in age between welders and controls is
The 2 binary variables race & smoking are neither significant by that standard nor by Fisher’s exact test for a 2×2 table.
Data genetic toxicity had \(L\) = 47 subjects, \(l = 1,2, ..., L\).
For subject \(l\), the observed covariate, \(\textbf{x}_l\), is 3-dimensional, \(\textbf{x}_l = (x_{l1}, x_{l2}, x_{l3})^T\) , where
\(x_{l1}\) is the age
\(x_{l2}\) encodes race, \(x_{l2}\) = 1 if \(l\) is African American, \(x_{l2}\) =0 if \(l\) is Caucasian,
\(x_{l3}\) encodes smoking, \(x_{l3}\) = 1 if \(l\) is is a current smoker, \(x_{l3}\) = 0 otherwise.
\(\rightarrow\) For instance, \(x_1\) = \((38,0,0)^T\) .
$Race <- ifelse(gen.tox$Race=="C", 0, 1)
gen.tox$Smoker <- ifelse(gen.tox$Smoker=="N", 0, 1) gen.tox
PS
is the conditional probability of exposure to
treatment given the observed covariates, \(e
(\textbf{X}) = Pr(\textbf{Z} = 1| \textbf{x})\).PS
is defined in terms of the observed covariates,
x, even though there is invariably concern about other covariates that
were not measured.PS
:
Brief examination of Genetic Toxicity data suggested:
at least age, \(x_1\), and
possibly race and smoking, \(x_2\) and
\(x_3\), could be used to predict
treatment assignment \(Z\), so that the
PS
is not constant.
Saying that welders tend to be somewhat younger than controls is much the same as saying that the chance of being a welder is lower for someone who is older, i.e. \(e(\textbf{X})\) is lower when \(x_1\) is higher.
If 2 subjects have the same PS
\(e(\textbf{X})\), they may have different
values of \(X\) \(\rightarrow\) subjects were matched for
\(e(\textbf{X})\), they may be
mismatched for x \(\rightarrow\) the
mismatches in x will be due to chance and will tend to balance,
particularly in large samples. ( e.g. if young
nonsmokers and old smokers have the same PS
, then a match
on the PS
may pair a young nonsmoking welder to an old
smoking control, but it will do this about as often as it pairs an old
smoking welder to a young nonsmoking control) \(\rightarrow\) matching on \(e(\textbf(X))\) tends to balance \(\textbf{X}\) \(\rightarrow\) \(Z\)| \(e(\textbf{X})\) indep. \(X\)
In short,
matching on \(e(\textbf{X})\) is often practical even when there are many covariates in x because \(e(\textbf{X})\) is a single variable,
failure to balance \(e(\textbf{X})\) implies that \(X\) is not balanced
matching on \(e(\textbf{X})\) tends to balance all of \(X\). On the negative side, success in balancing the observed covariates, \(X\), provides no assurance that unmeasured covariates are balanced
\(\rightarrow\) For example, the men whose his father worked as a welder is associated with a much higher chance that the son also will work as a welder, so father’s occupation is an unmeasured covariate that could be used to predict treatment \(Z\); then success in matching on the propensity score \(e(\textbf{X})\) would tend to balance age, race and smoking, but there is no reason to expect it to balance father’s occupation.
PS
The PS
is unknown, but it can be estimated from the data
at hand. PS
is estimated by a linear logit model \[ log {\frac{e(\textbf{X}_l)}{1 -
e(\textbf{X}_l)}} = \zeta_0 + \zeta_1 x_{l1} + \zeta_2 x_{l2} + \zeta_3
x_{l3}\]
and the fitted values \(\hat{e}(\textbf{X}_l)\) from this model are
the estimates of the PS
.
#fit a propensity score model. logistic regression
$Group <- relevel(as.factor(gen.tox$Group), ref="Controls")
gen.tox<- glm(Group ~ Age + Race + Smoker, family=binomial(), data=gen.tox)
psmodel
#show coefficients etc
summary(psmodel)
Call:
glm(formula = Group ~ Age + Race + Smoker, family = binomial(),
data = gen.tox)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4968 -1.0216 -0.5245 1.0786 1.8186
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.06543 2.14200 1.431 0.152
Age -0.08503 0.05178 -1.642 0.101
Race -0.76900 0.98252 -0.783 0.434
Smoker 0.55095 0.65212 0.845 0.398
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 64.623 on 46 degrees of freedom
Residual deviance: 58.706 on 43 degrees of freedom
AIC: 66.706
Number of Fisher Scoring iterations: 3
#create propensity score
<-psmodel$fitted.values
pscore$ps <- round(pscore,2)
gen.tox
# mean of PS in Welders group
<- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
mean.ps.welder
# mean of PS in 21 largest Controls group
.21largestcontrols <- gen.tox %>%
mean.psfilter(Group=="Controls") %>%
top_n(21, ps) %>% # wrapper that uses filter() and min_rank() to select the top or bottom entries in each group, ordered by ps.
arrange(desc(ps)) %>% # order data follow ps decreasing
summarize(round(mean(ps),2)) %>%
as.numeric()
Estimated PS
for 21 railroad arc welders and 26
potential controls. Covariates are age, race (C=Caucasian, AA=African
American), current smoker (Y=yes, N=no).
ID | Age | Race | Smoker | DPC | \(\hat{e}(\textbf{X})\) | ID | Age | Race | Smoker | DPC | \(\hat{e}(\textbf{X})\) |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 38 | C | N | 1.77 | 0.46 | 1 | 48 | AA | N | 1.08 | 0.14 |
2 | 44 | C | N | 1.02 | 0.34 | 2 | 63 | C | N | 1.09 | 0.09 |
3 | 39 | C | Y | 1.44 | 0.57 | 3 | 44 | C | Y | 1.1 | 0.47 |
4 | 33 | AA | Y | 0.65 | 0.51 | 4 | 40 | C | N | 1.1 | 0.42 |
5 | 35 | C | Y | 2.08 | 0.65 | 5 | 50 | C | N | 0.93 | 0.23 |
6 | 39 | C | Y | 0.61 | 0.57 | 6 | 52 | C | N | 1.11 | 0.2 |
7 | 27 | C | N | 2.86 | 0.68 | 7 | 56 | C | N | 0.98 | 0.15 |
8 | 43 | C | Y | 4.19 | 0.49 | 8 | 47 | C | N | 2.2 | 0.28 |
9 | 39 | C | Y | 4.88 | 0.57 | 9 | 38 | C | N | 0.88 | 0.46 |
10 | 43 | AA | N | 1.08 | 0.2 | 10 | 34 | C | N | 1.55 | 0.46 |
11 | 41 | C | Y | 2.03 | 0.53 | 11 | 42 | C | N | 0.55 | 0.54 |
12 | 36 | C | N | 2.81 | 0.5 | 12 | 36 | C | Y | 1.04 | 0.38 |
13 | 35 | C | N | 0.94 | 0.52 | 13 | 41 | C | N | 1.66 | 0.64 |
14 | 37 | C | N | 1.43 | 0.48 | 14 | 41 | AA | Y | 1.49 | 0.4 |
15 | 39 | C | Y | 1.25 | 0.57 | 15 | 31 | AA | Y | 1.36 | 0.35 |
16 | 34 | C | N | 2.97 | 0.54 | 16 | 56 | AA | Y | 1.02 | 0.55 |
17 | 35 | C | Y | 1.01 | 0.65 | 17 | 51 | AA | N | 0.99 | 0.13 |
18 | 53 | C | N | 2.07 | 0.19 | 18 | 36 | C | Y | 0.65 | 0.12 |
19 | 38 | C | Y | 1.15 | 0.6 | 19 | 44 | C | N | 0.42 | 0.64 |
20 | 37 | C | N | 1.07 | 0.48 | 20 | 35 | C | N | 2.33 | 0.34 |
21 | 38 | C | Y | 1.63 | 0.6 | 21 | 34 | C | Y | 0.97 | 0.52 |
22 | 39 | C | Y | 0.62 | |||||||
23 | 45 | C | N | 1.02 | |||||||
24 | 42 | C | N | 1.78 | |||||||
25 | 30 | C | N | 0.95 | |||||||
26 | 35 | C | Y | 1.59 |
<- round(mean(gen.tox$ps[gen.tox$Group=="Welders"]),2)
ps.w <- round(mean(gen.tox$ps[gen.tox$Group=="Controls"]),2) ps.c
Welders — Controls
Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
---|---|---|---|---|---|---|---|---|
38 | 10 | 52 | 0.51 | — | 43 | 19 | 35 | 0.4 |
Welders #10 and #18 have similar \(\hat{e}(\textbf{x})\) but different patterns of covariates \(X\).
The limitations of pair matching, apply with any variable,
including the PS
. The mean of the 21 largest \(\hat{e}(\textbf{x})\)’s in the control
group is 0.46, somewhat less than the mean of \(\hat{e}(\textbf{x})\)) in the treated
group, namely 0.51, so no pair matching can completely close that
gap.
A distance matrix is a table with one row for each
treated subject and one column for each potential control.
The value in row \(i\) and column \(j\) of the table is the
distance
between the \(i^{th}\) treated subject and the \(j^{th}\) potential control. 2 individuals
with the same value \(\mathbf{x}\)
would have distance zero.
The welder data: 21 rows x 26 columns (21x26) distance. The distance is the squared difference in the \(\hat{e}(\textbf{x})\)
<- gen.tox$ps[gen.tox$Group=="Controls"]
ps.vec.c <- gen.tox$ps[gen.tox$Group=="Welders"]
ps.vec.w <- matrix(NA, nrow = 21, ncol = 26)
d.m for (i in 1:21){
for (j in 1:26){
<- round((ps.vec.w[i] - ps.vec.c[j])^2, 2)
d.m[i,j]
} }
Table. Squared differences in PS
between welders and controls: \((\hat{e}(\textbf{X})_{Welder} -
\hat{e}(\textbf{X})_{Control})^2\)
Rows are the 21 welders and columns are for the first 6 of 26 potential controls.
<- c(1:21)
Welder <- as.data.frame(d.m)
d.m.d names(d.m.d) <- paste0("Control ", 1:26)
<- as.data.frame(cbind(Welder, d.m.d))
d.m.d.n kable(d.m.d.n[,1:7], caption = "Squared differences in `PS` between welders and controls") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
---|---|---|---|---|---|---|
1 | 0.10 | 0.14 | 0.00 | 0.00 | 0.05 | 0.07 |
2 | 0.04 | 0.06 | 0.02 | 0.01 | 0.01 | 0.02 |
3 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
4 | 0.14 | 0.18 | 0.00 | 0.01 | 0.08 | 0.10 |
5 | 0.26 | 0.31 | 0.03 | 0.05 | 0.18 | 0.20 |
6 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
7 | 0.29 | 0.35 | 0.04 | 0.07 | 0.20 | 0.23 |
8 | 0.12 | 0.16 | 0.00 | 0.00 | 0.07 | 0.08 |
9 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
10 | 0.00 | 0.01 | 0.07 | 0.05 | 0.00 | 0.00 |
11 | 0.15 | 0.19 | 0.00 | 0.01 | 0.09 | 0.11 |
12 | 0.13 | 0.17 | 0.00 | 0.01 | 0.07 | 0.09 |
13 | 0.14 | 0.18 | 0.00 | 0.01 | 0.08 | 0.10 |
14 | 0.12 | 0.15 | 0.00 | 0.00 | 0.06 | 0.08 |
15 | 0.18 | 0.23 | 0.01 | 0.02 | 0.12 | 0.14 |
16 | 0.16 | 0.20 | 0.00 | 0.01 | 0.10 | 0.12 |
17 | 0.26 | 0.31 | 0.03 | 0.05 | 0.18 | 0.20 |
18 | 0.00 | 0.01 | 0.08 | 0.05 | 0.00 | 0.00 |
19 | 0.21 | 0.26 | 0.02 | 0.03 | 0.14 | 0.16 |
20 | 0.12 | 0.15 | 0.00 | 0.00 | 0.06 | 0.08 |
21 | 0.21 | 0.26 | 0.02 | 0.03 | 0.14 | 0.16 |
Look at the disadvantage:
- that 2 controls with the same \(\hat{e}(\mathbf{x})\) may have different
patterns of covariates, \(\mathbf{x}\),
and this is ignored
\(\Rightarrow\) In the \(1^{st}\) row and \(3^{rd}\) and \(4^{th}\) columns of Table above, the
distance is 0 to two decimal places. Actually, the distances between
welder #1 and potential controls #3 and #4 are, respectively, \((0.46−0.47)^2 = 0.0001\) and \((0.46−0.42)^2 = 0.0016,\) so control #3 is
ever so slightly closer to welder #1.
- However, in terms of the details of \(\textbf{x}\), control #4
looks to be the better match, a nonsmoker with a two-year
difference in age, as opposed to control #3, a
smoker with a six-year difference in age.
\(\rightarrow\) Because younger smokers
are more common in the welder group, the PS
is indifferent
between a younger nonsmoker and an older smoker, but the details of
\(\textbf{x}\) suggest that
control #4 is a better match for welder #1 than is
control #3.
PS
An alternative distance (Paul R. Rosenbaum and Rubin 1985):
PS
, \(\hat{e}(\textbf{x})\), but once this is
achieved, the details of \(\textbf{x}\)
affect the distance.caliper
of width \(w\), if two individuals, say \(k\) and \(l\),
PS
that differ by more than \(w\) i.e. \(|
\hat{e}(\textbf{x}_k) - \hat{e}(\textbf{x}_l) | > w\) \(\rightarrow\) set \(w\) = \(\infty\)<- round(sd(pscore),3) # [1] 0.172
sd.ps <- round(sd.ps/2,3)
w paste0("The standard deviation of ps: ", sd.ps)
[1] "The standard deviation of ps: 0.172"
paste0("The caliper w (half of sd of ps): ", sd.ps, ", used to demonstate in this post.")
[1] "The caliper w (half of sd of ps): 0.172, used to demonstate in this post."
PS
, so
that by varying the multiplier, one can vary the relative importance
given to \(\hat{e}(\textbf{x})\) and
\(\textbf{x}\).
PS
, in which the caliper is half of the standard deviation
of the PS
, or 0.172/2 = 0.086.In problems of practical size, a
caliper
of 20% of the sd ofps
is more common, and even that may be too large. A reasonable strategy: to start with acaliper
of 20% of the sd ofps
\(\rightarrow\) adjusting the caliper if needed to obtain balance on the propensity score.
If \(\hat{\Sigma}\) is the sample covariance matrix of \(\textbf{x}\), then the estimated Mahalanobis distance between \(\textbf{x}_k\) and \(\textbf{x}_l\) is \[ (\textbf{x}_k - \textbf{x}_l)^T \hat{\Sigma}^{-1} (\textbf{x}_k - \textbf{x}_l) \]
<-gen.tox %>%
var.m ::select(Age, Race, Smoker) %>%
dplyrunname() %>%
as.matrix()
# sample variance covariance matrix
<- cov(var.m)
cov.m round(cov.m,2)
[,1] [,2] [,3]
[1,] 54.04 0.39 -0.94
[2,] 0.39 0.13 0.02
[3,] -0.94 0.02 0.25
# inverse of sample variance covariance matrix
<- inv(cov.m)
cov.m.inv round(cov.m.inv,3)
[,1] [,2] [,3]
[1,] 0.021 -0.077 0.084
[2,] -0.077 8.127 -1.009
[3,] 0.084 -1.009 4.407
dmat
Rubin (1980)Distance matrix would have 21 treated subject (Welders) and 26
controls; the value in row \(i\) and
column \(j\) of the table is the
distance
between the \(i^{th}\) treated subject and the \(j^{th}\) potential control.
<-gen.tox %>%
var.m.w filter(Group=="Welders") %>%
::select(Age, Race, Smoker) %>%
dplyras.matrix()
<-gen.tox %>%
var.m.c filter(Group=="Controls") %>%
::select(Age, Race, Smoker) %>%
dplyras.matrix()
<- matrix(NA, nrow = 21, ncol = 26)
m.d.m for (i in 1:21){
for (j in 1:26){
<- t(var.m.w[i,] - var.m.c[j,]) %*% cov.m.inv %*% (var.m.w[i,] - var.m.c[j,])
m.d.m[i,j]
}
}
# Mahalanobis distance
<- c(1:21)
Welder <- as.data.frame(m.d.m)
d.m.d.m names(d.m.d.m) <- paste0("Control ", 1:26)
<- cbind(Welder, d.m.d.m)
d.d.m.d.n kable(round(d.d.m.d.n[,1:7],2), caption = "Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
---|---|---|---|---|---|---|
1 | 8.65 | 12.82 | 6.15 | 0.08 | 2.95 | 4.02 |
2 | 7.84 | 7.40 | 4.41 | 0.33 | 0.74 | 1.31 |
3 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
4 | 6.51 | 28.55 | 12.29 | 11.42 | 16.20 | 17.65 |
5 | 13.85 | 15.80 | 1.66 | 4.08 | 6.51 | 7.49 |
6 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
7 | 13.95 | 26.58 | 13.18 | 3.47 | 10.85 | 12.82 |
8 | 13.46 | 9.27 | 0.02 | 5.09 | 4.24 | 4.56 |
9 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
10 | 0.51 | 19.40 | 14.89 | 7.85 | 10.20 | 11.17 |
11 | 13.31 | 10.65 | 0.18 | 4.59 | 4.56 | 5.05 |
12 | 9.24 | 14.95 | 7.06 | 0.33 | 4.02 | 5.25 |
13 | 9.60 | 16.08 | 7.57 | 0.51 | 4.61 | 5.93 |
14 | 8.92 | 13.87 | 6.58 | 0.18 | 3.47 | 4.61 |
15 | 13.33 | 12.21 | 0.51 | 4.26 | 5.05 | 5.70 |
16 | 10.00 | 17.25 | 8.13 | 0.74 | 5.25 | 6.65 |
17 | 13.85 | 15.80 | 1.66 | 4.08 | 6.51 | 7.49 |
18 | 9.41 | 2.05 | 4.56 | 3.47 | 0.18 | 0.02 |
19 | 13.40 | 13.04 | 0.74 | 4.15 | 5.35 | 6.08 |
20 | 8.92 | 13.87 | 6.58 | 0.18 | 3.47 | 4.61 |
21 | 13.40 | 13.04 | 0.74 | 4.15 | 5.35 | 6.08 |
Another way to produce the Mahalanobis distance matrix: look at the
source code of mahal
function.
source("mahal.R")
print(mahal)
function (z, X)
{
X <- as.matrix(X)
n <- dim(X)[1]
rownames(X) <- 1:n
k <- dim(X)[2]
m <- sum(z)
cv <- cov(X)
out <- matrix(NA, m, n - m)
Xt <- X[z == 1, ]
Xc <- X[z == 0, ]
rownames(out) <- rownames(X)[z == 1]
colnames(out) <- rownames(X)[z == 0]
require(MASS)
icov <- ginv(cv)
for (i in 1:m) {
out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
}
out
}
PS
calipers.Rows are the 21 welders and columns are for the first 6 of 26 potential controls. An \(\infty\) signifies that the caliper is violated.
<- matrix(NA, nrow = 21, ncol = 26)
m.m for (i in 1:21){
for (j in 1:26){
<- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m[i,j])
m.m[i,j]
}
}<- c(1:21)
Welder <- as.data.frame(m.m)
d.m.m names(d.m.m) <- paste0("Control ", 1:26)
<- cbind(Welder, d.m.m)
d.d.m.m kable(round(d.d.m.m[,1:7],2), caption = "Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
---|---|---|---|---|---|---|
1 | Inf | Inf | 6.15 | 0.08 | Inf | Inf |
2 | Inf | Inf | Inf | 0.33 | Inf | Inf |
3 | Inf | Inf | Inf | Inf | Inf | Inf |
4 | Inf | Inf | 12.29 | Inf | Inf | Inf |
5 | Inf | Inf | Inf | Inf | Inf | Inf |
6 | Inf | Inf | Inf | Inf | Inf | Inf |
7 | Inf | Inf | Inf | Inf | Inf | Inf |
8 | Inf | Inf | 0.02 | 5.09 | Inf | Inf |
9 | Inf | Inf | Inf | Inf | Inf | Inf |
10 | 0.51 | Inf | Inf | Inf | 10.20 | 11.17 |
11 | Inf | Inf | 0.18 | Inf | Inf | Inf |
12 | Inf | Inf | 7.06 | 0.33 | Inf | Inf |
13 | Inf | Inf | 7.57 | Inf | Inf | Inf |
14 | Inf | Inf | 6.58 | 0.18 | Inf | Inf |
15 | Inf | Inf | Inf | Inf | Inf | Inf |
16 | Inf | Inf | 8.13 | Inf | Inf | Inf |
17 | Inf | Inf | Inf | Inf | Inf | Inf |
18 | 9.41 | Inf | Inf | Inf | 0.18 | 0.02 |
19 | Inf | Inf | Inf | Inf | Inf | Inf |
20 | Inf | Inf | 6.58 | 0.18 | Inf | Inf |
21 | Inf | Inf | Inf | Inf | Inf | Inf |
Mahalanobis distance
was originally developed for
use with multivariate Normal data
\(\rightarrow\) With data that are not
Normal, Mahalanobis distance
can exhibit some rather odd
behavior.
If one covariate
\(\rightarrow\) its standard deviation will be inflated, and \(\rightarrow\) Mahalanobis distance will tend to ignore that covariate in matching.
\(\rightarrow\)
Mahalanobis distance
gives greater weight to binary
variables with probabilities near 0 or 1 than to binary variables with
probabilities closer to 1/2.
welder data
,
Age
and Race
but different smoking behavior
have a Mahalanobis distance
of 4.41,Age
and
Smoking behavior
but different Race
have a
Mahalanobis distance of 8.06,Race
is counted as almost twice as bad as a mismatch on
Smoking
.Age
with the same
Race
and Smoking
would have a
Mahalanobis distance
of 0.021 × \(20^2\) = 8.204232,Race
counts about as much as a 20-year difference in
Age
.\(\Longrightarrow\) In many contexts, rare binary covariates are not of overriding importance, and outliers do not make a covariate unimportant \(\rightarrow\)
Mahalanobis distance
may not be appropriate with covariates of this kind.
rank-based Mahalanobis distance
### Note that the covariance matrix now is on the rank-based matrix
<- gen.tox %>%
data.r ::select(Age, Race, Smoker) %>%
dplyras.matrix()
for (j in 1:dim(data.r)[2]){data.r[,j] <- rank(data.r[,j])}
<- cov(data.r)
cov.m.r
## Step 1
<- data.r[gen.tox$Group=="Welders",]
var.m.w.r <- data.r[gen.tox$Group=="Controls",]
var.m.c.r
## Step 2:
<- var(1:dim(gen.tox)[1])
var.untied <- sqrt(var.untied/diag(cov.m.r))
rat
<- diag(rat) %*% cov.m.r %*% diag(rat)
cov.m.r.adjusted <- inv(cov.m.r.adjusted)
cov.m.r.adjusted.inv
<- matrix(NA, nrow = 21, ncol = 26)
m.d.m.r
for (i in 1:21){
for (j in 1:26){
<- t(var.m.w.r[i,] - var.m.c.r[j,]) %*% cov.m.r.adjusted.inv %*% (var.m.w.r[i,] - var.m.c.r[j,])
m.d.m.r[i,j]
}
}
# rank-based Mahalanobis distance
<- c(1:21)
Welder <- as.data.frame(m.d.m.r)
d.m.d.m.r names(d.m.d.m.r) <- paste0("Control ", 1:26)
<- cbind(Welder, d.m.d.m.r)
d.d.m.d.m.r kable(round(d.d.m.d.m.r[,1:7],2), caption = "Rank-based Mahalanobis distance between 21 Welders vs 26 Controls dealed with Age, Race and Smoker") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
---|---|---|---|---|---|---|
1 | 4.61 | 4.40 | 5.98 | 0.33 | 2.69 | 3.21 |
2 | 2.98 | 0.70 | 3.21 | 0.47 | 0.15 | 0.28 |
3 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
4 | 8.14 | 14.80 | 10.43 | 7.69 | 12.17 | 13.00 |
5 | 9.03 | 8.49 | 3.93 | 3.66 | 6.55 | 7.15 |
6 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
7 | 10.20 | 12.31 | 12.86 | 3.93 | 9.30 | 10.26 |
8 | 6.77 | 3.29 | 0.04 | 3.92 | 2.99 | 3.05 |
9 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
10 | 0.25 | 4.72 | 7.61 | 3.03 | 3.72 | 4.01 |
11 | 6.71 | 3.79 | 0.28 | 3.38 | 3.18 | 3.34 |
12 | 5.86 | 6.33 | 7.61 | 0.98 | 4.24 | 4.89 |
13 | 6.98 | 7.96 | 9.02 | 1.68 | 5.59 | 6.33 |
14 | 5.25 | 5.41 | 6.83 | 0.64 | 3.49 | 4.08 |
15 | 6.91 | 4.62 | 0.84 | 3.04 | 3.66 | 3.93 |
16 | 8.30 | 9.78 | 10.61 | 2.56 | 7.12 | 7.96 |
17 | 9.03 | 8.49 | 3.93 | 3.66 | 6.55 | 7.15 |
18 | 3.33 | 0.05 | 3.00 | 1.68 | 0.05 | 0.01 |
19 | 7.34 | 5.62 | 1.58 | 2.99 | 4.34 | 4.72 |
20 | 5.25 | 5.41 | 6.83 | 0.64 | 3.49 | 4.08 |
21 | 7.34 | 5.62 | 1.58 | 2.99 | 4.34 | 4.72 |
Another way to produce the rank-based Mahalanobis distance matrix:
look at the source code of smahal
function.
source("smahal.R")
print(smahal)
function (z, X)
{
X <- as.matrix(X)
n <- dim(X)[1]
rownames(X) <- 1:n
k <- dim(X)[2]
m <- sum(z)
for (j in 1:k) {
X[, j] <- rank(X[, j])
}
cv <- cov(X)
vuntied <- var(1:n)
rat <- sqrt(vuntied/diag(cv))
cv <- diag(rat) %*% cv %*% diag(rat)
out <- matrix(NA, m, n - m)
Xc <- X[z == 0, ]
Xt <- X[z == 1, ]
rownames(out) <- rownames(X)[z == 1]
colnames(out) <- rownames(X)[z == 0]
library(MASS)
icov <- ginv(cv)
for (i in 1:m) {
out[i, ] <- mahalanobis(Xc, Xt[i, ], icov, inverted = T)
}
out
}
welder data
,
Age
and Race
but different smoking behavior
have a
rank-based Mahalanobis distance of 3.21 (4.41 for a Mahalanobis
distance),Age
and Smoking behavior
but different
Race
have a rank-based Mahalanobis distance of 3.07 (8.06
for a Mahalanobis distance),\(\Longrightarrow\) a mismatch on
race
is counted as about equal to a mismatch onsmoking
.
Rank-based Mahalanobis distances within Propensity Score
calipers. An Inf
signifies that the caliper is violated
<- matrix(NA, nrow = 21, ncol = 26)
m.m.r for (i in 1:21){
for (j in 1:26){
<- ifelse(abs(ps.vec.w[i] - ps.vec.c[j]) > w, Inf, m.d.m.r[i,j])
m.m.r[i,j]
}
}<- c(1:21)
Welder <- as.data.frame(m.m.r)
d.m.m.r names(d.m.m.r) <- paste0("Control ", 1:26)
<- cbind(Welder, d.m.m.r)
d.d.m.m.r kable(round(d.d.m.m.r[,1:7],2), caption = "rank-based Mahalanobis distances within PS calipers. An `Inf` signifies that the caliper is violated.") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Welder | Control 1 | Control 2 | Control 3 | Control 4 | Control 5 | Control 6 |
---|---|---|---|---|---|---|
1 | Inf | Inf | 5.98 | 0.33 | Inf | Inf |
2 | Inf | Inf | Inf | 0.47 | Inf | Inf |
3 | Inf | Inf | Inf | Inf | Inf | Inf |
4 | Inf | Inf | 10.43 | Inf | Inf | Inf |
5 | Inf | Inf | Inf | Inf | Inf | Inf |
6 | Inf | Inf | Inf | Inf | Inf | Inf |
7 | Inf | Inf | Inf | Inf | Inf | Inf |
8 | Inf | Inf | 0.04 | 3.92 | Inf | Inf |
9 | Inf | Inf | Inf | Inf | Inf | Inf |
10 | 0.25 | Inf | Inf | Inf | 3.72 | 4.01 |
11 | Inf | Inf | 0.28 | Inf | Inf | Inf |
12 | Inf | Inf | 7.61 | 0.98 | Inf | Inf |
13 | Inf | Inf | 9.02 | Inf | Inf | Inf |
14 | Inf | Inf | 6.83 | 0.64 | Inf | Inf |
15 | Inf | Inf | Inf | Inf | Inf | Inf |
16 | Inf | Inf | 10.61 | Inf | Inf | Inf |
17 | Inf | Inf | Inf | Inf | Inf | Inf |
18 | 3.33 | Inf | Inf | Inf | 0.05 | 0.01 |
19 | Inf | Inf | Inf | Inf | Inf | Inf |
20 | Inf | Inf | 6.83 | 0.64 | Inf | Inf |
21 | Inf | Inf | Inf | Inf | Inf | Inf |
optimal pair matching
pairs each treated subject
with a different control to minimize the total distance within matched
pairs.
Hansen’s pairmatch
function in his optmatch
package makes Bertsekas’ Fortran code (aka ‘assignment
problem’ solved by Kuhn in 1955) available from inside the
statistical package R;
The SAS program proc assign
also solves the assignment
problem.
PS
<- NULL
ctrl <- NULL
k
<- d.m.d.n[-1]
d
for (i in 1:21){
<- which.min(d[i,])
k <- names(k)
ctrl[i] <- d[-k]
d
}
<- base::order(ctrl)
index
<- gen.tox %>%
Welders filter(Group=="Welders") %>%
::select(Age, Race, Smoker, ps)
dplyr$Pair <- paste0("Pair ",1:21)
Welders<- Welders[,c("Pair","Age","Race","Smoker","ps")]
Welders
<- gen.tox %>%
Controls filter(Group=="Controls") %>%
::select(Age, Race, Smoker, ps)
dplyr
$id <- paste0("Control ",1:26)
Controls
<- Controls[which(Controls$id %in% ctrl),]
matched.controls
# Reorder follow the ctrl match order
$id <- reorder.factor(matched.controls$id, new.order=ctrl)
matched.controls#matched.controls <- matched.controls[-5]
<- matched.controls %>% dplyr::arrange(id)
matched.controls
kable(cbind(Welders,matched.controls), caption="Pair match using the closed measure in the PS") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Pair | Age | Race | Smoker | ps | Age | Race | Smoker | ps | id |
---|---|---|---|---|---|---|---|---|---|
Pair 1 | 38 | 0 | 0 | 0.46 | 44 | 0 | 1 | 0.47 | Control 3 |
Pair 2 | 44 | 0 | 0 | 0.34 | 47 | 0 | 0 | 0.28 | Control 8 |
Pair 3 | 39 | 0 | 1 | 0.57 | 34 | 0 | 0 | 0.54 | Control 10 |
Pair 4 | 33 | 1 | 1 | 0.51 | 38 | 0 | 0 | 0.46 | Control 9 |
Pair 5 | 35 | 0 | 1 | 0.65 | 36 | 0 | 1 | 0.64 | Control 12 |
Pair 6 | 39 | 0 | 1 | 0.57 | 31 | 1 | 1 | 0.55 | Control 15 |
Pair 7 | 27 | 0 | 0 | 0.68 | 36 | 0 | 1 | 0.64 | Control 18 |
Pair 8 | 43 | 0 | 1 | 0.49 | 40 | 0 | 0 | 0.42 | Control 4 |
Pair 9 | 39 | 0 | 1 | 0.57 | 35 | 0 | 0 | 0.52 | Control 20 |
Pair 10 | 43 | 1 | 0 | 0.20 | 48 | 1 | 0 | 0.14 | Control 1 |
Pair 11 | 41 | 0 | 1 | 0.53 | 39 | 0 | 1 | 0.57 | Control 22 |
Pair 12 | 36 | 0 | 0 | 0.50 | 42 | 0 | 0 | 0.38 | Control 11 |
Pair 13 | 35 | 0 | 0 | 0.52 | 41 | 0 | 0 | 0.40 | Control 13 |
Pair 14 | 37 | 0 | 0 | 0.48 | 42 | 0 | 0 | 0.38 | Control 24 |
Pair 15 | 39 | 0 | 1 | 0.57 | 30 | 0 | 0 | 0.63 | Control 25 |
Pair 16 | 34 | 0 | 0 | 0.54 | 35 | 0 | 1 | 0.65 | Control 26 |
Pair 17 | 35 | 0 | 1 | 0.65 | 34 | 0 | 1 | 0.67 | Control 21 |
Pair 18 | 53 | 0 | 0 | 0.19 | 50 | 0 | 0 | 0.23 | Control 5 |
Pair 19 | 38 | 0 | 1 | 0.60 | 41 | 1 | 1 | 0.35 | Control 14 |
Pair 20 | 37 | 0 | 0 | 0.48 | 44 | 0 | 0 | 0.34 | Control 19 |
Pair 21 | 38 | 0 | 1 | 0.60 | 45 | 0 | 0 | 0.32 | Control 23 |
<- round(mean(matched.controls$Age),0)
m.age.mps
<- round(table(matched.controls$Race)[2]/21*100,0)
aa.p.mps
<- round(table(matched.controls$Smoker)[2]/21*100,0)
s.p.mps
<- round(mean(matched.controls$ps),2) ps.mps
Welders — Matched Controls
Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
---|---|---|---|---|---|---|---|---|
38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.46 |
\(\Longrightarrow\) the marginal means the matching above are well balanced as expected
Next, I use the pairmatch
function in
optmatch
package to compare.
<- gen.tox
gen.tox.op $Group1 <- as.numeric(ifelse(gen.tox.op$Group=="Welders", 1,0))
gen.tox.op#pairmatch(match_on( Group~ps, data=gen.tox ) )# did not run
<- pairmatch(Group1 ~ ps, data = gen.tox.op)
pm print(pairmatch(Group1 ~ ps, data = gen.tox.op), grouped = TRUE) # knowing the ordered id of treated and controls
Group Members
1.1 1, 35
1.10 18, 27
1.11 19, 46
1.12 2, 29
1.13 20, 34
1.14 21, 39
1.15 3, 41
1.16 4, 30
1.17 5, 47
1.18 6, 43
1.19 7, 42
1.2 10, 26
1.20 8, 45
1.21 9, 31
1.3 11, 25
1.4 12, 40
1.5 13, 44
1.6 14, 24
1.7 15, 36
1.8 16, 32
1.9 17, 33
summary(pairmatch(Group1 ~ ps, data = gen.tox.op))
Structure of matched sets:
1:1 0:1
21 5
Effective Sample Size: 21
(equivalent number of matched pairs).
#all.equal(names(pm), row.names(gen.tox))
## Housekeeping
<- as.integer(pm)
ipm <- cbind(gen.tox.op, matches=pm, ipm)
gen.tox.op <-gen.tox.op[matched(pm),] # only select matched cases
data.pairmatch
<- data.pairmatch %>%
Welders filter(Group=="Welders") %>%
arrange(ipm) %>%
::select(Age, Race, Smoker, ps, ipm)
dplyr$Pair <- paste0("Pair ",1:21)
Welders<- Welders[,c("Pair","Age","Race","Smoker","ps","ipm")]
Welders
<- data.pairmatch %>%
matched.controls filter(Group=="Controls") %>%
arrange(ipm) %>%
::select(Age, Race, Smoker, ps, ipm)
dplyr
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the squared difference in the PS") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Pair | Age | Race | Smoker | ps | ipm | Age | Race | Smoker | ps | ipm | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 41 | 1 | 1 | 0.35 | 1 |
18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 30 | 0 | 0 | 0.63 | 3 |
2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 41 | 0 | 0 | 0.40 | 5 |
21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 36 | 0 | 1 | 0.64 | 6 |
3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 35 | 0 | 0 | 0.52 | 7 |
4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 38 | 0 | 0 | 0.46 | 8 |
5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 39 | 0 | 1 | 0.57 | 10 |
7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 34 | 0 | 1 | 0.67 | 11 |
10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 50 | 0 | 0 | 0.23 | 12 |
8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 42 | 0 | 0 | 0.38 | 13 |
9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 34 | 0 | 0 | 0.54 | 14 |
11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 40 | 0 | 0 | 0.42 | 15 |
12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 44 | 0 | 0 | 0.34 | 16 |
13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 45 | 0 | 0 | 0.32 | 17 |
14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 44 | 0 | 1 | 0.47 | 18 |
15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 31 | 1 | 1 | 0.55 | 19 |
16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 42 | 0 | 0 | 0.38 | 20 |
17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 36 | 0 | 1 | 0.64 | 21 |
<- round(mean(matched.controls$Age),0)
m.age.mps
<- round(table(matched.controls$Race)[2]/21*100,0)
aa.p.mps
<- round(table(matched.controls$Smoker)[2]/21*100,0)
s.p.mps
<- round(mean(matched.controls$ps),2) ps.mps
Welders — Matched Controls
Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
---|---|---|---|---|---|---|---|---|
38 | 10 | 52 | 0.51 | — | 40 | 10 | 38 | 0.46 |
\(\Longrightarrow\) the marginal means the matching above are fairly well balanced, but the individual pairs are often not matched for race or smoking.
Mahalanobis distance within PS calipers
The caliper is half the standard deviation of the PS
, or
0.172/2 = 0.086
<-1*(gen.tox$Group=="Welders")
z<-gen.tox$Race
aa=gen.tox$Smoker
smoker<-gen.tox$Age
age<-cbind(age,aa,smoker)
x# Mahalanobis distances
<-mahal(z,x)
dmat
# Impose propensity score calipers
<-gen.tox$ps # propensity score
prop# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
<-addcaliper(dmat,z,prop,caliper=.5)
dmat
# Find the minimum distance match within propensity score calipers.
<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)
pm.cal
<- as.integer(pm.cal)
ipm.cal <- cbind(gen.tox, matches=pm.cal,ipm.cal)
gen.tox.op.caliper <-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases
data.pairmatch.cal
<- data.pairmatch.cal %>%
Welders filter(Group=="Welders") %>%
arrange(ipm.cal) %>%
::select(Age, Race, Smoker, ps,ipm.cal)
dplyr$Pair <- paste0("Pair ",1:21)
Welders<- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]
Welders
<- data.pairmatch.cal %>%
matched.controls filter(Group=="Controls") %>%
arrange(ipm.cal) %>%
::select(Age, Race, Smoker, ps, ipm.cal)
dplyr
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Pair | Age | Race | Smoker | ps | ipm.cal | Age | Race | Smoker | ps | ipm.cal | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 44 | 0 | 0 | 0.34 | 1 |
18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 34 | 0 | 0 | 0.54 | 3 |
2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 42 | 0 | 0 | 0.38 | 5 |
21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 31 | 1 | 1 | 0.55 | 6 |
3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 36 | 0 | 1 | 0.64 | 7 |
4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 41 | 1 | 1 | 0.35 | 8 |
5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 35 | 0 | 0 | 0.52 | 10 |
7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 30 | 0 | 0 | 0.63 | 11 |
10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 48 | 1 | 0 | 0.14 | 12 |
8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 45 | 0 | 0 | 0.32 | 13 |
9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 36 | 0 | 1 | 0.64 | 14 |
11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 44 | 0 | 1 | 0.47 | 15 |
12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 41 | 0 | 0 | 0.40 | 16 |
13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 40 | 0 | 0 | 0.42 | 17 |
14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 42 | 0 | 0 | 0.38 | 18 |
15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 39 | 0 | 1 | 0.57 | 19 |
16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 38 | 0 | 0 | 0.46 | 20 |
17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 34 | 0 | 1 | 0.67 | 21 |
<- round(mean(matched.controls$Age),0)
m.age.mps
<- round(table(matched.controls$Race)[2]/21*100,0)
aa.p.mps
<- round(table(matched.controls$Smoker)[2]/21*100,0)
s.p.mps
<- round(mean(matched.controls$ps),2) ps.mps
Welders — Matched Controls
Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
---|---|---|---|---|---|---|---|---|
38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.45 |
rank-based Mahalanobis distance within PS calipers
mahal
by
smahal
<-1*(gen.tox$Group=="Welders")
z<-gen.tox$Race
aa=gen.tox$Smoker
smoker<-gen.tox$Age
age<-cbind(age,aa,smoker)
x# rank-based Mahalanobis distances
<-smahal(z,x)
dmat
# Impose propensity score calipers
<-gen.tox$ps # propensity score
prop# Mahalanobis distanced penalized for violations of a propensity score caliper.
# This version is used for numerical work.
<-addcaliper(dmat,z,prop,caliper=.5)
dmat
# Find the minimum distance match within propensity score calipers.
<-optmatch::pairmatch(dmat,data=gen.tox)#, remove.unmatchables = TRUE)
pm.cal
<- as.integer(pm.cal)
ipm.cal <- cbind(gen.tox, matches=pm.cal,ipm.cal)
gen.tox.op.caliper <-gen.tox.op.caliper[matched(pm.cal),] # only select matched cases
data.pairmatch.cal
<- data.pairmatch.cal %>%
Welders filter(Group=="Welders") %>%
arrange(ipm.cal) %>%
::select(Age, Race, Smoker, ps,ipm.cal)
dplyr$Pair <- paste0("Pair ",1:21)
Welders<- Welders[,c("Pair","Age","Race","Smoker","ps","ipm.cal")]
Welders
<- data.pairmatch.cal %>%
matched.controls filter(Group=="Controls") %>%
arrange(ipm.cal) %>%
::select(Age, Race, Smoker, ps, ipm.cal)
dplyr
kable(cbind(Welders,matched.controls), caption="Optimal pair match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
Pair | Age | Race | Smoker | ps | ipm.cal | Age | Race | Smoker | ps | ipm.cal | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | Pair 1 | 38 | 0 | 0 | 0.46 | 1 | 44 | 0 | 0 | 0.34 | 1 |
18 | Pair 2 | 53 | 0 | 0 | 0.19 | 2 | 52 | 0 | 0 | 0.20 | 2 |
19 | Pair 3 | 38 | 0 | 1 | 0.60 | 3 | 31 | 1 | 1 | 0.55 | 3 |
2 | Pair 4 | 44 | 0 | 0 | 0.34 | 4 | 47 | 0 | 0 | 0.28 | 4 |
20 | Pair 5 | 37 | 0 | 0 | 0.48 | 5 | 42 | 0 | 0 | 0.38 | 5 |
21 | Pair 6 | 38 | 0 | 1 | 0.60 | 6 | 34 | 0 | 0 | 0.54 | 6 |
3 | Pair 7 | 39 | 0 | 1 | 0.57 | 7 | 35 | 0 | 0 | 0.52 | 7 |
4 | Pair 8 | 33 | 1 | 1 | 0.51 | 8 | 41 | 1 | 1 | 0.35 | 8 |
5 | Pair 9 | 35 | 0 | 1 | 0.65 | 9 | 35 | 0 | 1 | 0.65 | 9 |
6 | Pair 10 | 39 | 0 | 1 | 0.57 | 10 | 36 | 0 | 1 | 0.64 | 10 |
7 | Pair 11 | 27 | 0 | 0 | 0.68 | 11 | 30 | 0 | 0 | 0.63 | 11 |
10 | Pair 12 | 43 | 1 | 0 | 0.20 | 12 | 48 | 1 | 0 | 0.14 | 12 |
8 | Pair 13 | 43 | 0 | 1 | 0.49 | 13 | 45 | 0 | 0 | 0.32 | 13 |
9 | Pair 14 | 39 | 0 | 1 | 0.57 | 14 | 36 | 0 | 1 | 0.64 | 14 |
11 | Pair 15 | 41 | 0 | 1 | 0.53 | 15 | 44 | 0 | 1 | 0.47 | 15 |
12 | Pair 16 | 36 | 0 | 0 | 0.50 | 16 | 41 | 0 | 0 | 0.40 | 16 |
13 | Pair 17 | 35 | 0 | 0 | 0.52 | 17 | 40 | 0 | 0 | 0.42 | 17 |
14 | Pair 18 | 37 | 0 | 0 | 0.48 | 18 | 42 | 0 | 0 | 0.38 | 18 |
15 | Pair 19 | 39 | 0 | 1 | 0.57 | 19 | 39 | 0 | 1 | 0.57 | 19 |
16 | Pair 20 | 34 | 0 | 0 | 0.54 | 20 | 38 | 0 | 0 | 0.46 | 20 |
17 | Pair 21 | 35 | 0 | 1 | 0.65 | 21 | 34 | 0 | 1 | 0.67 | 21 |
<- round(mean(matched.controls$Age),0)
m.age.mps
<- round(table(matched.controls$Race)[2]/21*100,0)
aa.p.mps
<- round(table(matched.controls$Smoker)[2]/21*100,0)
s.p.mps
<- round(mean(matched.controls$ps),2) ps.mps
Welders — Matched Controls
Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) | — | Mean Age | AA | Smoker | \(\hat{e}(\mathbf{x})\) |
---|---|---|---|---|---|---|---|---|
38 | 10 | 52 | 0.51 | — | 40 | 14 | 38 | 0.45 |
In matching with multiple controls, each treated subject is matched to at least one, and possibly more than one, control:
a fixed ratio
is to match each treated
subject to the same number of controls; for instance, pair matching in a
ratio of 1-to-1, or matching in a ratio of 1-to-2.a variable ratio
: to allow the number of
controls to vary from one treated subject to another. In the welder data
above, allows the possibility of matching with multiple controls in a
variable ratio, and in particular to use all of the controls.The decision to match in fixed
or
variable ratio
that affects both:
Matching with a variable number of controls is slightly more complex:
variable controls
minimizes the total distance between
treated subjects and controls inside matched sets.The principal advantage of matching in a fixed ratio:
direct adjustment
to give unequal
weights to observations. This advantage will weigh heavily when
potential controls are abundant and the biases that must be removed are
not large (!?!).Several advantages to matching with a variable ratio Paul R. Rosenbaum (1989):
Finding an optimal match with variable controls is equivalent to solving a particular minimum-cost flow problem in a network (Paul R. Rosenbaum 1989).
In full matching, besides matching with a variable number of controls, i.e. a treated subject could be matched to one or more controls, the reverse situation is permitted: one control may be matched to several treated subjects (P. R. Rosenbaum 1991).
As in matching with variable controls, summary statistics for the control group in full matching must be directly adjusted.
Full matching can be vastly better than pair matching or matching with a variable number of controls. The matched sets would be quite homogeneous, the quite unequal set sizes can lead to some inefficiency.
In one specific sense, an optimal full matching is an optimal design for an observational study (P. R. Rosenbaum 1991). Specifically, define a stratification to be a partitioning of the subjects into groups or strata based on the covariates with the one requirement that each stratum must contain at least one treated subject and at least one control. The quality of a stratification might reasonably be judged by a weighted average of all the within strata distances between treated subjects and controls.
No matter which of these weightings is used, no matter what distance is used, there is always a full matching that minimizes this weighted average distance (P. R. Rosenbaum 1991).
<-fullmatch(dmat, data=gen.tox, min.controls = 1/7,max.controls = 7)
mlength(m)
[1] 47
sum(matched(m))
[1] 47
# Housekeeping
<-as.integer(m)
im<-cbind(gen.tox,im)
gen.tox.fm<-gen.tox[matched(m),]
g.t.m
<- gen.tox.fm %>%
Welders filter(Group=="Welders") %>%
arrange(im) %>%
::select(Age, Race, Smoker, ps, im)
dplyr#Welders$Pair <- paste0("Pair ",1:21)
<- Welders[,c("Age","Race","Smoker","ps","im")]
Welders
<- gen.tox.fm %>%
fmatched.controls filter(Group=="Controls") %>%
arrange(im) %>%
::select(Age, Race, Smoker, ps, im)
dplyr
kable(list(Welders,fmatched.controls), caption="Optimal full match using the Mahalanobis distance within PS calipers") %>%
kable_styling(bootstrap_options = "striped", full_width = F)
|
|
(be continued)
Matching on one variable, the propensity score
, tends to
produce treated and control groups that, in aggregate, are balanced with
respect to observed covariates;
\(\rightarrow\) Cons: individual pairs that are close on the propensity score may differ widely on specific covariates.
\(\rightarrow\) Solution: to form
closer pairs, a distance
is used that penalizes
large differences on the propensity score, and then finds individual
pairs that are as close as possible.
Pairs or matched sets are constructed using an optimization
algorithm. Matching with variable controls
and
full matching
combine elements of matching with elements of
direct adjustment.
Full matching
can often produce closer matches than
pair matching
.
Chapter 9, Basic Tools of Multivariate Matching from Rosenbaum, Paul R. Design of Observational Studies. 2nd ed. 2020., Springer International Publishing, 2020, https://doi.org/10.1007/978-3-030-46405-9.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2022, April 19). HaiBiostat: Basic Tools of Multivariate Matching. Retrieved from https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/
BibTeX citation
@misc{nguyen2022basic, author = {Nguyen, Hai}, title = {HaiBiostat: Basic Tools of Multivariate Matching}, url = {https://hai-mn.github.io/posts/2022-04-19-Basic Tools of Multivariate Matching/}, year = {2022} }